An Evaluation Exercise for Word Alignment
نویسندگان
چکیده
This paper presents the task definition, resources, participating systems, and comparative results for the shared task on word alignment, which was organized as part of the HLT/NAACL 2003 Workshop on Building and Using Parallel Texts. The shared task included Romanian-English and English-French sub-tasks, and drew the participation of seven teams from around the world. 1 Defining a Word Alignment Shared Task The task of word alignment consists of finding correspondences between words and phrases in parallel texts. Assuming a sentence aligned bilingual corpus in languages L1 and L2, the task of a word alignment system is to indicate which word token in the corpus of language L1 corresponds to which word token in the corpus of language L2. As part of the HLT/NAACL 2003 workshop on ”Building and Using Parallel Texts: Data Driven Machine Translation and Beyond”, we organized a shared task on word alignment, where participating teams were provided with training and test data, consisting of sentence aligned parallel texts, and were asked to provide automatically derived word alignments for all the words in the test set. Data for two language pairs were provided: (1) EnglishFrench, representing languages with rich resources (20 million word parallel texts), and (2) Romanian-English, representing languages with scarce resources (1 million word parallel texts). Similar with the Machine Translation evaluation exercise organized by NIST1, two subtasks were defined, with teams being encouraged to participate in both subtasks. http://www.nist.gov/speech/tests/mt/ 1. Limited resources, where systems are allowed to use only the resources provided. 2. Unlimited resources, where systems are allowed to use any resources in addition to those provided. Such resources had to be explicitly mentioned in the system description. Test data were released one week prior to the deadline for result submissions. Participating teams were asked to produce word alignments, following a common format as specified below, and submit their output by a certain deadline. Results were returned to each team within three days of submission. 1.1 Word Alignment Output Format The word alignment result files had to include one line for each word-to-word alignment. Additionally, lines in the result files had to follow the format specified in Fig.1. While the and confidence fields overlap in their meaning, the intent of having both fields available is to enable participating teams to draw their own line on what they consider to be a Sure or Probable alignment. Both these fields were optional, with some standard values assigned by default. 1.1.1 A Running Word Alignment Example Consider the following two aligned sentences: [English] s snum=18 They had gone . /s [French] s snum=18 Ils etaient alles . /s A correct word alignment for this sentence is 18 1 1 18 2 2 18 3 3 18 4 4 stating that: all the word alignments pertain to sentence 18, the English token 1 They aligns with the French token 1 Ils, the English token 2 had, aligns with the French token 2 etaient, and so on. Note that punctuation is also sentence no position L1 position L2 [ ] [confidence]
منابع مشابه
A Gold Standard for English-Swedish Word Alignment
Word alignment gold standards are an important resource for developing and evaluating word alignment methods. In this paper we present a free English–Swedish word alignment gold standard consisting of texts from Europarl with manually verified word alignments. The gold standard contains two sets of word aligned sentences, a test set for the purpose of evaluation and a training set that can be u...
متن کاملWord to word alignment strategies
Word alignment is a challenging task aiming at the identification of translational relations between words and multi-word units in parallel corpora. Many alignment strategies are based on links between single words. Different strategies can be used to find the optimal word alignment using such one-toone word links including relations between multi-word units. In this paper seven algorithms are ...
متن کاملImproving Word Alignment Using Alignment of Deep Structures
In this paper, we describe differences between a classical word alignment on the surface (word-layer alignment) and an alignment of deep syntactic sentence representations (tectogrammatical alignment). The deep structures we use are dependency trees containing content (autosemantic) words as their nodes. Most of other functional words, such as prepositions, articles, and auxiliary verbs are hid...
متن کاملWord-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings
One of the most important problems in machine translation (MT) evaluation is to evaluate the similarity between translation hypotheses with different surface forms from the reference, especially at the segment level. We propose to use word embeddings to perform word alignment for segment-level MT evaluation. We performed experiments with three types of alignment methods using word embeddings. W...
متن کاملInteractive Word Alignment for Language Engineering
In this paper we report ongoing work on developing an interactive word alignment environment that will assist a user to quickly produce accurate full-coverage word alignment in bitexts for different language engineering tasks, such as MT lexicons and gold standards for evaluation. The system uses a graphical interface, static and dynamic resources as well as machine learning techniques. We also...
متن کامل